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On Unique Decodability 

Marco Dalai, Riccardo Leonardi 



Abstract — In this paper we propose a revisitation of the topic 
of unique decodability and of some fundamental theorems of 
lossless coding. It is widely believed that, for any discrete source 
X, every "uniquely decodable" block code satisfies 

E[l{XiX2 ■ ■ ■ X,,)] > H{Xi,X2,...,X„), 

where Xi, X2, . . . , Xn are the first n symbols of the source, 
E[l {X1X2 ■ ■ ■ Xn)] is the expected length of the code for those 
symbols and H{Xi,X2, . . . , Xn) is their joint entropy. We show 
that, for certain sources with memory, the above inequality only 
holds when a limiting definition of "uniquely decodable code" is 
considered. In particular, the above inequality is usually assumed 
to hold for any "practical code" due to a debatable application 
of McMillan's theorem to sources with memory. We thus propose 
a clarification of the topic, also providing an extended version of 
McMillan's theorem to be used for Markovian sources. 

Index Terms — Lossless source coding, McMillan's theorem, 
constrained sources, minimum expected code length. 
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I. Introduction 

The problem of lossless encoding of information sources has 
been intensively studied over the years (see HI Sec. II] for a 
detailed historical overview of the key developments in this 
field). Shannon initiated the mathematical formulation of the 
problem in his major work |2| and provided the first results 
on the average number of bits per source symbol that must 
be used asymptotically in order to represent an information 
source. 

For a random variable X with alphabet X and probability 
mass function px{'), he defined the entropy of X as the 
quantity 

= ^px(x)log 

xex 

On another hand, Shannon focused his attention on finite state 
Markov sources X = {Xi,X2, . . .}, for which he defined the 
entropy as 

H{X) = lim -HiXi,X2,...,Xn), 

n^QG TL 

a quantity that is now usually called entropy rate of the source. 
Based on these definitions, he derived the fundamental results 
for fixed length and variable length codes. In particular, he 
showed that, by encoding sufficiently large blocks of symbols, 
the average number of bits per symbol used by fixed length 
codes can be made as close as desired to the entropy rate of 
the source while maintaining the probability of error as small 
as desirable. If variable length codes are allowed, furthermore, 
he showed that the probability of error can be reduced to zero 
without increasing the asymptotically achievable average rate. 
Shannon also proved the converse theorem for the case of fixed 
length codes, but he did not explicitly consider the converse 
theorem for variable length codes (see |1 Sec. II. C]). 

The authors are with the Department of Electronics for Automation, Uni- 
versity of Brescia, via Branze 38 - 25123, Brescia, Italy. Email: {marco.dalai, 
riccardo . leonardi } @ ing . unibs .it 



An important contribution in this direction came from 
McMillan [31, who showed that every "uniquely decodable" 
code using a Z?-ary alphabet must satisfy Kraft's inequality, 
< 1, h being the codeword lengths |4|. Based on 
this result, he was able to prove that the expected length of 
a uniquely decodable code for a random variable X is not 
smaller than its entropy, E[l{X)] < H{X). This represents a 
strong converse result in coding theory. However, while the 
initial work by Shannon was explicitly referring to finite state 
Markov sources, McMillan's results basically considered only 
the encoding of a random variable. This leads to immediate 
conclusions on the problem of encoding memoryless sources, 
but an ad hoc study is necessary for the case of sources with 
memory. The application of McMillan's theorem to these type 
of sources can be found in 15, Sec. 5.4] and |6, Sec. 3.5]. In 
these two well-known references, McMillan's result is used not 
only to derive a converse theorem on the asymptotic average 
number of bits per symbol needed to represent an information 
source, but also to deduce a non-asymptotic strong converse to 
the coding theorem. In particular, the famous result obtained 
(see 14, Th. 3.5.2], li5j Th. 5.4.2], I7j Sec. II, p. 2047]) is that, 
for every source with memory, any uniquely decodable code 
satisfies 



E[l{X^X2 ■ ■ ■ X„)] > H{X,,X2, . . . , X„), 



(1) 



where Xi,X2, ■ ■ ■ , Xn are the first n symbols of the source, 
E[l{XiX2 ■ ■ ■ Xn)] is the expected length of the code for 
those symbols and H{Xi, X2, ■ ■ ■ , Xn) represents their joint 
entropy. 

In this paper we want to clarify that the above equation is 
only valid if a limiting definition of "uniquely decodable code" 
is assumed. In particular, we show that there are information 
sources for which a reversible encoding operation exists that 
produces a code for which equation ([T]i does not hold any 
longer for every n. This is demonstrated through a simple 
example in Section |ll] In Section Ull] we revisit the topic of 
unique decodability, consequently providing an extension of 
McMillan's theorem for the case of first order Markov sources. 
Finally, in Section |IV] some additional interesting remarks on 
the considered topic are made. 

II. A Meaningful Example 

Let X — {Xi,X2, . . .} be a first order Markov source with 
alphabet X — {A, B, C, D} and with transition probabilities 
shown by the graph of Fig. [T] Its transition probability matrix 
is thus 

1/2 1/2 
1/2 1/2 
1/4 1/4 1/4 1/4 
1/4 1/4 1/4 1/4 

where rows and columns are associated to the natural alpha- 
betical order of the symbol values A, B, C and D. 

It is not difficult to verify that the stationary distribution 
associated with this transition probability matrix is the uniform 
distribution. Let Xi be uniformly distributed, so that the source 
X is stationary and, in addition, ergodic. 

Let us now examine possible binary encoding techniques 
for this source and possibly find an optimal one. In order to 




Fig. 1. Graph, with transition probabilities, for the Markov source use in 
the example. 



evaluate the performance of different codes we determine the 
entropy of the sequences of symbols that can be produced by 
this source. By stationarity of the source, one easily proves 
that 

n 



i=2 



3 



= 2+-(n-l), 

where H{Xi\Xi^i) is the conditional entropy of Xi given 
that is 

1 



x,yeX 



px,\x,^Ax\y)' 



Let us now consider the following binary codes to represent 
sequences produced by this source. 

Classic code 

We call this first code "classic" as it is the most natural 
way to encode the source given its particular structure. Since 
the first symbol is uniformly distributed between four choices, 
2 bits are used to uniquely identify it, in an obvious way. 
For the next symbols we note that we always have dyadic 
conditional probabilities. So, we apply a state-dependent code. 
For encoding the fc-th symbol we use, again in an obvious 
way, 1 bit if symbol fc — 1 was an A or a B, and we use 2 
bits if symbol fc — 1 was a C or a Z?. This code seems to 
perfectly fulfill the source as the number of used bits always 
corresponds to the uncertainty. Indeed, the average length of 
the code for the first n symbols is given by 



example, the sequences AB and C are both coded to 01. This 
is usually expressed, see for example Q, by saying that the 
code is not uniquely decodable, an expression which suggests 
the idea that the code cannot be inverted, different sequences 
being associated to the same code. It is however easy to 
notice that, for the source considered in this example, the 
code does not introduce any ambiguity. Different sequences 
that are producible by the source are in fact mapped into 
different codes. Thus it is possible to "decode" any sequence 
of bits without ambiguity. For example the code 01 can only be 
produced by the single symbol C and not by the sequence AB, 
since our source cannot produce such sequence (the transition 
from A io B being impossible). It is not difficult to verify 
that it is indeed possible to decode any sequence of bits by 
operating in the following way. Consider first the case when 
there are still two or more bits to decode. In such a case, 
for the first pair of encountered bits, if a 00 (respectively a 
11) is observed then clearly this corresponds to an A symbol 
followed by a code starting with a (respectively a B symbol 
followed by a code starting with a 1). If, instead, a 01 pair 
is observed (respectively a 10) then a C must be decoded 
(respectively a D). Finally, if there is only one bit left to 
decode, say a or a 1, the decoded symbol is respectively an A 
or a B. Such coding and decoding operations are summarized 
in Table I] 

Now, what is the performance of this code? The expected 
number of bits in coding the first n symbols is given by: 

n 

E[i{XiX2X3---Xn)] = 

i=l 

3 

= r 

Unexpectedly, the average number of bits used by the code 
is strictly smaller than the entropy of the symbols. So, the 
performance of this code is better than what would have been 
traditionally considered the "optimal" code, that is the classical 
code. Let us mention that this code is not only more efficient 
on average, but it is at least as efficient as the classic code 
for every possible sequence which remains compliant with 
the source characteristics. For each source sequence, indeed, 
the number of decoded symbols after reading the first m bits 
of the alternative code is always larger than or equal to the 
number of symbols decoded with the first m bits of the classic 



E[l{Xi,X2,...,X^)] 



E[l{X^)] 
2 + hn-l). 



Y,E[l{X, 



i=2 



So, the expected number of bits used for the first n symbols is 
exactly the same as their entropy, which would let us declare 
that this encoding technique is optimal. 

Alternative code 

Let us consider a different code, obtained by applying the 
following fixed mapping from symbols to bits: A ^ Q, B ^ 1, 
C 01, D 10. It will be easy to see that this code maps 
different sequences of symbols into the same codeword. For 



Encoding 


A ^ 
B 1 
C ^ 01 
_D ^ 10 


Decoding 


more bits left 


one bit left 


00 . . . ^ A + 0... 
01... ^ C... 
10... D... 
11... B + 1... 


^ A 

1 ^ B 



TABLE I 

Table of encoding and decoding operations of the proposed 

ALTERNATIVE CODE FOR THE MARKOV SOURCE OF FIGURe[T] 



code. Hence, the proposed alternative code is more efficient 
than the classic code in all respects. The obtained gain per 
symbol obviously goes to zero asymptotically, as imposed by 
the Asymptotic Equipartition Property. However, in practical 
cases we are usually interested in coding a finite number of 
symbols. Thus, this simple example reveals that the problem of 
finding an optimal code is not yet well understood for the case 
of sources with memory. The obtained results may thus have 
interesting consequences not only from a theoretical point of 
view, but even for practical purposes in the case of sources 
exibiting constraints imposing high order dependencies. 

Commenting on the "alternative code", one may object that 
it is not fair to use the knowledge on impossible transitions in 
order to design the code. But probably nobody would object 
to the design of what we called the "classic code". Even in 
that case, however, the knowledge that some transitions are 
impossible was used, in order to construct a state-dependent 
"optimal" code. 

It is important to point out that we have just shown a fixed 
to variable length code for a stationary ergodic source that 
maps sequences of n symbols into strings of bits that can 
be decoded and such that the average code length is smaller 
than the entropy of those n symbols. Furthermore, this holds 
for every n, and not for an a priori fixed n. In a sense we could 
say that the given code has a negative redundancy. Note that 
there is a huge difference between the considered setting and 
that of the so called one-to-one codes (see for example ID for 
details). In the case of one-to-one codes, it is assumed that only 
one symbol, or a given known amount of symbols, must be 
coded, and codes are studied as maps from symbols to binary 
strings without considering the decodability of concatenation 
of codewords. Under those hypotheses, Wyner [9J first pointed 
out that the average codeword length can always be made 
lower than the entropy, and different authors have studied 
bounds on the expected code length over the years 1 10|, 1 11 1. 
Here, instead, we have considered a fixed-to-variable length 
code used to compress sequences of symbols of whatever 
length, concatenating the code for the symbols one by one, 
as in the most classic scenario. 

III. Unique decodability for constrained sources 

In this section we briefly survey the literature on unique 
decodability and we then propose an adequate treatment of 
the particular case of constrained sources defined as follows. 

Definition 1: A source X = {Xi,X2, ■ ■ ■} with symbols in 
a discrete alphabet X is a constrained source if there exists a 
finite sequence of symbols from X that cannot be obtained as 
output of the source X. 

A. Classic definitions and revisitation 

It is interesting to consider how the topic of unique decod- 
ability has been historically dealt with in the literature and 
how the results on unique decodability are used to deduce 
results on the expected length of codes. Taking |6| and |5| as 
representative references for what can be viewed as the classic 
approach to lossless source coding, we note some common 



structures between them in the development of the theory, but 
also some interesting differences. The most important fact to 
be noticed is the use, in both references with only marginal 
differences, of the following chain of deductions: 

(a) McMillan's theorem asserts that all uniquely decodable 
codes satisfy Kraft's inequality; 

(b) If a code for a random variable X satisfies Kraft's 
inequality, then E[l{X)] > H{X); 

(c) Thus any uniquely decodable code for a random variable 
X satisfies E[l{X)] > H{X); 

(d) For sources with memory, by considering sequences of n 
symbols as super-symbols, we deduce that any uniquely 
decodable code satisfies X2, . . . , X„)] > 
H{Xi, X2, . . . , Xn). 

In the above flow of deductions there is an implicit as- 
sumption which is not obvious and, in a certain way, not 
clearly supported. It is implicitly assumed that the definition of 
uniquely decodable code used in McMillan's theorem is also 
appropriate for sources with memory. Of course, by definition 
of "definition", one can freely choose to define "uniquely 
decodable code" in any preferred way. However, as shown 
by the code of Table U in the previous section, the definition 
of uniquely decodable code used in McMillan's theorem does 
not coincide with the intuitive idea of "decodable" for certain 
sources with memory. To our knowledge, this ambiguity 
has never been reported previously in the literature, and for 
this reason it has been erroneously believed that the result 
E[l{Xi,X2, . . . , > H{X,,X2, . . . , X„) holds for every 
"practically usable" code. As shown by the Markov source 
example presented, this interpretation is incorrect. 

In order to better understand the confusion associated to 
the meaning of "uniquely decodable code", it is interesting 
to focus on a small difference between the formal definitions 
given by the authors in |5| and in 16). We start by rephrasing 
for notational convenience the definition given by Cover and 
Thomas in ||5l. 

Definition 2: 121 Sec. 5.1, pp. 79-80] A code is 
said to be uniquely decodable if no finite sequence 
of code symbols can be obtained in two or more 
different ways as a concatenation of codewords. 

Note that this definition is the same used in McMillan's paper 
1 3 1, and it considers a property of the codebook without any 
reference to sources. It is however difficult to find a clear 
motivation for such a source independent definition. After all, 
a code is always designed for a given source, not for a given 
alphabet. Indeed, right after giving the formal definitions, the 
authors comment 

"In other words, any encoded string in a uniquely 
decodable code has only one possible source string 
producing it." 

So, a reference to sources is introduced. What is not noticed is 
that the condition given in the formal definition coincides with 
the phrased one only if the source at hand can produce any 
possible combination of symbols as output. Conversely, the 
two definitions are not equivalent, the first one being stronger, 
the second one being instead "more intuitive". 



With respect to formal definitions, Gallager proceeds in a 
different way with the following: 

Definition 3: ||6l Sec. 3.2, pg. 45] "A code is 
uniquely decodable if for each source sequence of 
finite length, the sequence of code letters corre- 
sponding to that source sequence is different from 
the sequence of code letters corresponding to any 
other source sequence." 
Note that this is a formal definition of unique decodability 
of a code with respect to a given source. Gallager states 
this definition while discussing memoryless sourceo In that 
case, the definition is clearly equivalent to Definition |2] but, 
unfortunately, Gallager implicitly uses Definition |2] instead of 
Definition |3] when dealing with sources with memoryU 

In order to avoid the above discussed ambiguity, we propose 
to adopt the following explicit definition. 

Definition 4: A code C is said to be uniquely decodable 
for the source X if no two different finite sequences of source 
symbols producible by X have the same code. 

With this definition, not all uniquely decodable codes for 
a given source satisfy Kraft's inequality. So, the chain of 
deductions |(a)|(d)| listed at the beginning of this section cannot 
be used for constrained sources, as McMillan's theorem uses 
Definition |2] of unique decodability. 

The alternative code of Table H] thus immediately gives: 
Lemma 1: There exists at least one source X and a uniquely 
decodable code for X such that, for every n > 1, 

E[l{Xi,X2,...,X^)] < H{Xi,X2,...,X„). 

B. Extension of McMillan's theorem to Markov sources 

In Section |II] the proposed alternative code demonstrates 
that McMillan's theorem does not apply in general to uniquely 
decodable codes for a constrained source X as defined in 
Definition 2] In this section a modified version of Kraft's 
inequality is proposed which represents a necessary condition 
for the unique decodability of a code for a first order Markov 
source. 

Let X be a Markov source with alphabet X — 
{1, 2, . . . , m} and transition probability matrix P. Let W — 
{wi, W2, . . . ,w,n] be a set of _D-ary codewords for the al- 
phabet X and let, U — l{wi) be the length of codeword Wi. 
McMillan's original theorem can be stated in the following 
way: 

Theorem 1 (McMillan, [Tj): If the set of codewords W is 
uniquely decodable (in the sense of Definition |2]i then 

m 

1=1 

We propose a modified theorem for considering the unique 
decodability for the specific source. 

'See (6] pg. 45] "We also assume, initially, [...] that successive letters are 
independent" 

^In fact, in (6], the proof of Theorem 3.5.2, on page 58, is based on 
Theorem 3.3.1, on page 50, the proof of which states: "...follows from Kraft's 
inequality, [...] which is valid for any uniquely decodable code". But Kraft's 
inequality is valid for uniquely decodable codes defined as in Definition |2] 
and not Definition |3] 



Theorem 2: If the set of codewords W is uniquely decod- 
able for the Markov source X, then the matrix Q defined by 



if 
if 







P^J > 



has spectral radius at most I. 

Proof: The proof is very similar to Karush's proof of 
McMillan's theorem |12J. Let be the set of all sequences 
of k symbols that can be produced by the source and let L = 
. . . , For fc > 0, define the row vector 

It is easy to see by induction that the i-th component of V'^'"') 
can be written as 



hi,h2,...,hk 



D 



-Ih, —Ih 



where the sum runs over all sequences of indices 
{hi,h2, . . . ,hk) with varying /ii, /12, . • • , /ifc-i and hk = i 
such that {hi, /12, . . . , hk) G X^''\ So, calling 1^ the vector 
composed of m I's, we have 



L'Q 



E 

{hi,h2,---,hk)exw 



D 



-Ih, —hi 



Reindexing the sum with respect to the total length r = Z/^ + 
^/t2 + " ■ ■ + and calling N{r) the number of sequences of 
A'^'^) which are mapped in a length r code, we have 



L'Q 



fe-i- 



N{r)D- 



, m. 



where Z^ax is the maximum of the values — 1,2, 
Since the code is uniquely decodable for the source X, there 
are at most source-compatible sequences with a code of 
length r, that is, N{r) < D^. Hence, for every A; > 



kin 



(2) 



Now, note that the irreducible matrix Q is also nonnegative. 
Thus, by the Perron-Frobenius theorem (see ifTSll for details), 
its spectral radius p(Q) is also an eigenvalue, with algebraic 
multiplicity 1 and with positive associated left eigenvector 

Suppose now p{Q) > 1. Since L and !,„ are both positive, 
it is easy to deduce that the term on the left hand side of 
equation Q asymptotically grows as p(Q)''^^ when k goes 
to infinity. On the contrary, the right hand side term only grows 
linearly with k and, for large enough k, equation (O cannot 
hold. We conclude that p(Q) < 1. ■ 

IV. Some Additional Remarks 

Remark 1 (Theorem^ generalizes Theorem\l^: In the case 
of unconstrained Markov sources. Theorem |2] is equivalent to 
Theorem [U Indeed, the Markov source being not constrained 
means that its transition probability matrix P has all strictly 
positive entries. This implies that the matrix Q defined in 
Theorem |2] has all equal rows. The spectral radius of such 



a matrix equals the sum of the elements in every row, which 
is J2j , reducing thus to the classic Kraft's inequality. 

Remark 2 (Non sufficiency of the condition): Kraft's 
inequality is both a necessary and sufficient condition for 
the existence of a uniquely decodable code (in the sense of 
Definition lU with codeword lengths li. Theorem |2] instead, 
only gives a necessary condition on the lengths k for the 
unique decodability of a code for a given source. It is easy to 
show that condition stated in the theorem is not a sufficient 
condition for the existence of a uniquely decodable code for 
a source with codeword lengths li. Finding a necessary and 
sufficient condition seems to be a much harder problem. 

Remark 3 (Extended Sardinas-Patterson test): With 
respect to the previous remark, we point out that it is however 
possible to test a given code for decodability for a given 
source by devising a generalization of the Sardinas-Patterson 
test IIT4I to deal with constrained sources (see [|15|). 

Remark 4 (A more general form of Theorem^: Theorem 
I2] was formulated for the case of Markov chains "in the 
Moore form", as considered for example in |5|. In other 
words, we have modeled information sources as Markov 
chains by assigning an output source symbol to every state. 
In order to deal with more general sources we can consider 
Markov sources in the "Mealy form", where output symbols 
are not associated to states but to transitions between states 
(which corresponds to the Markov source model used by 
Shannon in |2| or, for example, by Gallager in (61). Theorem 
|2] can be extended to this type of Markov sources as follows 
(see HSl). 

Theorem 3: Let X he a finite state source, with possible 
states Si, S2, ■ ■ ■ , Sq and with output symbols in the alphabet 
X = {1,2,..., to}. Let W — {wi, . . . ,ijUm} be a set of 
codewords for the symbols in X with lengths li,l2, ■ ■ ■ , Im- 
Let Oij be the subsets of X of possible symbols output by 
the source when transiting from state Si to state Sj, Oij being 
the empty set if transition from Si to Sj is impossible. If the 
code is uniquely decodable for the source X, then the matrix 
Q defined by 

heOij 

has spectral radius at most 1. 

Remark 5 ( Shannon 's insight): 
An historical analysis reveals that both McMillan's theorem 
and the proposed generalized one in the form of Theorem [5] 
are mathematically equivalent to a formulation obtained by 
Shannon already in ||2] Part I, Sec. 1] for the evaluation of 
the capacity of discrete noiseless channels. In particular, in 
llll Shannon established that the capacity of an unconstrained 
noiseless channel with symbol durations ti,t2, ■ ■ ■ , t„i is given 
by the value logXp, where Xq is the largest real solution of 
the difference equation 



Furthermore, Shannon generalized the capacity formula to 
the case of noiseless finite state channels, by stating the 
following f2^, Th. 1]: 

Theorem 4 (Shannon, fiTl): Let b[j'' be the duration 

of the s^^ symbol which is allowable in state i and 
leads to state j. Then the channel capacity C is equal 
to log Wq where Wq is the largest real root of the 
determinant equation: 



= 0. 



As for the unconstrained case, it is possible to show that 
Theorem |3] is equivalent to the statement that every finite state 
_D-ary channel has capacity at most log D. 
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It is not difficult to show that McMillan's theorem is equivalent 
to the obvious statement that the capacity of a D-ary channel 
is at most log D. 



